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SUMMARY 



The problem investigated in this research is that of the non- 
independent error component introduced into scores on multiple-choice 
tests due to chance guessing success. This class of error operates 
somewhat differently than the random, independent error described in 
classical test theory. It differs in two important aspects. First, it 
operates unidirectionally, in that it can only increase an individual's 
score on a test; secondly, the usual assumption of independence of the 
error component of a score can not be made in this case. This is because 
such error is negatively correlated with an individual's true score, and 
positively correlated with the same class of error on parallel forms of 
the test. Since this type of error violates the assumptions made in the 
derivation of many of the equations of classical test theory, new 
equations must be derived to adequately describe the effects of this 
error as reflected in an individual's score. 

The method used to investigate the effects of non-independent error 
is similar to that which has been employed in the past to investigate other 
classes of error. Basic models are adopted and equations are derived from 
these models. The major difference in procedure for the purpose of this 
research is that the assumption of independence is not made. Further, the 
derived equations are validated against the results of computer simulation 
techniques. These techniques begin with a prepared distribution of true 
scores. An IBM 1620 is then programmed to generate one or more classes of 
error corresponding to a particular true score. The sum of the true score 
and the error components is then representative of an observed score on a 
test. This procedure is repeated to generate data that is a simulation of 
results on parallel forms of a test. The resultant data is then subjected 
to statistical analysis to obtain descriptive quantities of the scores and 
their components. Finally, the equations that have been derived to 
describe the effects of the particular class or classes of error generated 
by the program are validated against the results. 

Three distinct problems were investigated in this research. The first 
problem is that of the dependence of test reliability upon the hetero- 
geneity of the group tested. In classical theory, this relationship is 
described as o 



In this equation, the reliability of some test administered to a 
particular group, A , is related to the reliability of the same test 
administered to another group, B; the quantities in the equation are 
related to group A or group B by subscripts. A cursory examination of 
the equation reveals that the reliabilities will be the same if the 
variances are the same, thus reducing the ratio of the variances to 1. 
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The derivation of this equation is based on certain assumptions concerning 
error variance that do not necessarily hold for certain classes of error. 

One such class of error is the non-independent error under consideration. 
Another class of error that does not allow such assumptions is the error 
connected with item sampling as described by Lord (1955) . A general 
equation is derived to describe this relationship, and the equation 
indicates that the classical equation is valid if and only if the average 
individual error variance is the same for the two groups. 

The second problem investigated is the applicability of coefficient 
alpha and other reliability estimates to testing situations in general and 
specifically to testing situations where non-independent error may be 
present. It is determined that the necessary and sufficient conditions 
for coefficient alpha to equal test reliability is the same regardless of 
the class or classes of error in operation. It is shown that coefficient 
alpha is equal to the reliability of a test if and only if, as the number 
of measurements increases without limit, the mean over persons of the 
variance over part-tests of the mean part-test scores over repeated 
measurements is equal to the variance over part-tests of the means over 
persons of the mean part-test scores over repeated measurements. Further, 
the Kuder-Richardson formula 20 (Kuder & Richardson, 1937) is equal to the 
reliability of a test if and only if all test items satisfy this same 
condition. Also, the Kuder-Richardson formula 21 is equal to the reliability 
of a test if and only if all test items satisfy this condition and if in 
addition, both of the quantities are zero. It is further shown that the 
conditions under which a composite test of N parts is equal to coefficient 
alpha are identical to the conditions under which the reliability of a 
test lengthened N times is given by the Spearman-Brown formula. An 
equation describing test reliability when these conditions are not met is 
derived. These results support recent findings by Novick & Lewis (1967). 

The final problem considered is that of the dependence of test 
reliability on the number of alternatives per item. It has been observed 
that as the number of alternatives per item increases, test reliability 
likewise increases (Lord, 1944; Carroll, 1945; Plumlee, 1952; Zimmerman & 
Williams, 1965). It has further been suggested that the relationship can 
be described by the Spearman-Brown formula, which was originally developed 
to describe increase in test reliability with increase in test length 
(Remm'ers, Karslake & Gage, 1940). It is found that increase in test 
reliability with increase in number of alternritives is indicated only 
approximately by the Speaiuiau-Biuwn formula. An equation is derived that 
indicates the relationship much more accurately. ' 

The results, of this study are probably not directly applicable to 
the solution of the day-to-day problems of testing; however, the findings 
do provide a solid foundation for the understanding of the operation of 
this particular class of error. Due to the widespread use of multiple- 
choice tests, it is probable that non-independent error due to chance 
guessing success is present to a large degree in the results of the 
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majority of tests used in the field of education. An understanding of this 
class of error is therefore highly desirable. It is also important in the 
development of test theory that one understand the differential operation 
of the different classes of error. Hopefully this research will serve not 
only as a base but also as a stimulation for further research in this area. 
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INTRODUCTION 



Traditional mental test theory has been largely founded on the 
assumption that errors of measurement are independent random variables 
with an expected value of zero. When these assumptions are made, certain 
intercorrelational terms relating the theoretical components of test 
scores are zero, and resultant equations are considerably simplified. 



Two basic models have been employed to derive equations in test 
theory; however, since the assumption of independence has been made in 
the use of both models, similar results have been obtained. The first 
model is based on a definition of an observed score on a test as the 
sum of two theoretical components, true score and an error component. 

This relationship is expressed as 0 = T + E. The alternate model defines 
true score as a limit: 




lim 




H 



lim 







H 



Where T 4 represents the true score for some individual j, 0 . represents 

^ O J 

the observed score of that individual on the g^^ form of H parallel forms 
of a test, and Eg* represents the error component of that observed score. 

Since, in the limit, the sum of the error components for this individual 
over repeated testings is zero; the equation simplifies to a definition 
of an individuals true score as the limiting value of the mean of the 
frequency distribution of his observed scores over repeated testings 
with parallel forms of a test (Gulliksen, 1950) . 



As an example, independence will be assumed under the specifications 
of the first model. Such an assumption implies that the magnitude and 
direction of the error component is in no way influenced by the magnitude 
of its true score counterpart. It follows from this implication that the 
correlation between true scores and corresponding error components is 

zero, or r^.^ = 0; it further follows that the correlation between error 

• - « 

components on parallel forms of a test is zero, or r^g * 0. With this 



conclusion, it is relatively easy to develop the classical equations for 
the reliability of a test: 




and r^Q 





4 



2 

where is reliability, is variance, and the subscripts £, t, and 

o refer to observed scores, true scores, and error components respec- 
tively. Other equations of classical test theory can likewise be de- 
rived with ease once the assumption of independence has been made. 

H.any tests .presently used are of the multiple-choice type, and it 
has been widely recognized that scores on such measuring instruments 
will reflect to some extent chance guessing success. Such chance guessing 
success is thus a class of error component of the score. It has not been 
recognized as widely that this class of error is non-independent. The 
relationship of this class of error to true score can be seen in a brief 
example. An individual with a true score of ten on a ten-item true-false 
test is faced with no items on which to guess the answer. The error com- 
ponent of his observed score as a reSult of chance guessing success must 
therefore be zero. Another individual with a true score of zero on the 
same test may guess on all items, and the error component of his observed 
score can range- from zero to ten, with an expected value of five. Stated 
generally, the lower an ind5.viduals true score on a multiple-choice test, 
the more items are available for which he can guess the answer; and the 
greater the number of items on which he guesses, the greater is his 
expected number of successful guesses. From this it follows that the ' 

correlation existing between true score and this class of error is a 
negative one. It also follows that such error components will be posi- 
tively correlated with like error components on parallel forms of the 
test. A further characteristic of this class of error is that it 
operates unidirectionally. That is, such error can only increment an 
individual's score. 

Due to the non-independent nature of this class of error and to its 
unidirectional character^ the effects of such error can not be effectively 
described by the equations of the classical test theory. A considerable 
amount of research has been conducted recently to examine some of the 
basic properties of this class of error (Burkheimer, 1965; Burkheimer, 
Zimmerman & Williams, 196;; Williams & Zimmerman, 1966; Zimmerman & 
Williams, 1965, 1967; Zimmerman, Williams & Rehm, 1966). The present 
research is an extension of this ongoing investigation. 
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METHODS 



The investigation into the properties of non-independent error 
proceed :id along two lines. The first line was the development of models 
from which equations describing the effects of this class of error could 
be derived; the second was the development of computer simulation programs 
to generate data against which the derived equations could be validated. 



In the first stage of the research, the basic models of classical 
test theory, as described above, were used as a foundation for equation 
derivation. The major difference in terms of deriving equations from these 
models in this research was the fact that the vassumptions of independence 
and zero-value expectancy were not made. Early results indicated that the 
classical equations for reliability could be stated in slightly modified 
form as follows: 
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where a is the number of alternatives per item and r^g is the correlation 
existing between non-independent error components. Other basic 

equations developed from these models likewise differed from the analogous 
equations of classical test theory. Some of the effects of non-independent 
error, however, led to results that were identical to those previously 
described in classical theory (e.g, the Spearman- Brown formula; Zimmerman 
& Williams, 1966). 



An interesting aspect of the derived aquations is that they easily 
reduce to the classical equations when the number of alternatives increases 
without limit, thus reducing the probability of a successful guess, 1/a, 
to a limiting value of zero and eliminating this particular class of error. 




reduces to the classical equation. Considering the second equation above 
together with the equation for r^g. 
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it is obvious that the limit, as a-# o* , of the term (1 - rgg) is 1, and 
again the equation reduces to the classical equation. 



In the later stages of the research, an additional model was utilized. 
This model resembles the second model described above in that it considers 
the frequency distribution of scores over repeated measurements on parallel 
forms of a test and in that it considers the parameters of this distribution 
as the number of repeated measurements increases without limit. The model 
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differs from the classical model, however, in that it does not introduce 
the theoretical concepts of true score and error components. The model, 
which is considered in greater dtLail in Parts I and II of the following 
section, considers- the mean of a distribution of observed scores for some 



individual j, Oj, 



and the variance of this distribution, (C 




Since 



no assumptions are made under this model of either independence or non- 
independence, the model is quite general: and the results obtained from 
the model are applicable under either assumption. 



A computer simulation technique which had been used quite effectively 
previously was further utilized in the present research. The first 
program developed was a duplication of the program previously described 
(Zimmerman & Williams, 1965). An IBM 1620 was used to generate scores on 
a "multiple-choice tes?-" with the probability of a successful guess set 
at a value equal to the reciprocal of the number of alternatives per item 
on "tests" of varying length. 



The procedure was simple in concept. A hypothetical distribution of 
true scores was prepared, ^nd for each true score on a "test" of N items, 
the computer simulated N-T guessing trials. "Correct guesses" were 

summed, and this sum represented the non-independent error component of 
an observed score, which was obtained by adding the generated error 
component to the respective true score. This procedure was repeated to 
generate results comparable to repeated testings on parallel forms of a 
test. Finally, the intercorrelation matrix of all possible combinations 
of observed scores, true scores, and error components was obtained. With 
these results, the equations derived under the assumption of non- 
independence were validated. 



The second program was an extension of the first, and all steps 
through the generation of a non-independent error component were the same. 
In this program, however, an additional error component was generated. 

The second error component was obtained by a random sampling technique in 
such a way that the resulting error component complied with the assumptions 
of ’ndependence. Observed score was obtained as before with the exception 
that it now consisted of the sum of a true score, an independent error , 
component, and a non-independent error component. An intercorrelation 
matrix was obtained as before. The results of this program were used to 
validate the equations previously derived for the case in which several 
classes of error components are reflected in an individual's score on a 
test (Zimmerman, Williams & Rehm, 1966) . 

An additional program was placed in operation,which generated error 
components of scores of the item-sampling class described by Lord (195.5) . 
The final results of this program became available only recently and have 
not yet been fully analyzed. 



RESULTS 



Part Dependence of Test Reliability Upon Heterogeneity of 
Individual and Group Score Distributions 

In the classical test theory the equation 




relates the reliability coefficient of a test obtained in a group, A, 



with a certain variance of observed scores, , to the reliability 

coefficient of that test obtained in a group, B, with a different 



O 

variance of observed scores, ^ (Gulliksen, 1950). Derivation of the 

B 

equation depends upon equating error variance in the two groups. In 
the classical theory the distribution of test scores for a given person 
over repeated measurements is regarded as the same for all persons and 
also the same as total error variance for a group of persons. The pur- 
pose of this paper is to determine the necessary and sufficient condi-. 
tions under which the above equation is valid and to examine modifications 
of the equation which are obtained when these conditions are not met. 



Definitions 



< 



The subscript g, taking on values from 1 to H, will refer to 
repeated measuremenfs, and the subscript j, taking on values from 1 to 
K, will refer to persons. We consider th¥ distribution of observed 
scores, Ogj for person j, over H independent repeated measurements. 

We consider the distribution of observed scores, Ogj, for this same 



person 
we cons^ 



j over H independent repeated parallel measurements. Finally, 

iTder the means, Oi and til, and the variances, (O’. ) and ( QT ^ ) 

J, J O T ^ Z 



H 

Z 



H-^o* H 



of the distributions of scores for person j, where 0. = lim g=l 

, E<l) 

where i ^ q ) - lini g~l 

j H 

be written in prime notation. 



Ogj 



-2 



- Oj, and where similar expressions can 
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Test reliability is defined as follows: 



( 2 ) 



H K H K H K 
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where QT *» lim 
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and where () ^ is defined by a similar expression in prime notation. 

That is, reliability as determined for these K persons is the product- 
moment correlation between pairs of repeated parallel measurements, as 
the number of pairs increases without limit. 

The frequency distributions of 0 . and o' . over repeated measure- 

gJ 8J 

ments for any person j are assumed to be the same, such that ( ^ = 

2i - - ” ■ 

( 6q.) Oj ~ assumed, however, that such distributions 

J 

are the same for different persons. It follows from the above that 

(io°^'o’ “ <To; , /®o,o; ° () 0. Of o' 

J j J J j 

the standard deviations of the 0^ values over all K persons and 

j ^ 

is the correlation between 0^ and 0^ for these K persons. 

We can consider the covariance term of (2) as the sum of two 
quantities, k 



( 3 ) 
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Since the pairs of repeated parallel measurements for person j are 

independent, it follows that A. o' ^ ffoP ^ covariance of 

j j j J 

^gj ^gj person j, is zero for all K persons. Tlierefore, 

/®ooG’o6'i CTtTj ffw! • the equalities established 

above, together with a standard theorem, n 

~zr H 

( ff: ) =iEi L- . 



C§ = ( <>Oj • 



K 



we arrive at the result 

2 

c- 



(4) 
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Dependence of Reliability Upon Group Heterogeneity 

2 



From 



a - Ao > • ( Go.)a (1 -/>oo 

A A j D * D J 



Subtracting the second of these equations from the first and solving 



for 



00 



, we obtain 



B 



Go 



(5) 
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From (5) it is evident that 



30 



= 1 - 



B 






B 



change in reliability with change in group heterogeneity is given by 
(1) if and only if the arithmetic mean, over the persons in group A, 

of the variances of observed scores over repeated measurements is 




10 



equal to the same quantity for the persons in group B. 

2 

In classical test theory ( ) is identified with total error 

variance, a quantity which has been regarded as the same for any two 
groups. For models in v/hich this condition does not hold, the re- 
lationship is given by (5) . It should be noted that in the above 
derivation 0 , has not been identifi^ with the concept of true score. 
Therefore, the conclusions reached hold even if is regarded as the 
sum of components which are linearly correlated (Zimmerman & Williams, 
1965, 1966). We will now consider two models in which the frequency 
distribution of observed scores over repeated measurements varies from 
person to person. 



Item Sampling Model 

Following the model presented by Lord (1955) for sampling of test 

2 - - 

items from, a population of items, we can write ( Q ) = ” ^j) 

where N is the number of items. Lord identifies the quantity 0 with 

J 

true score, T^, representing the product of the proportion of items in 

the population which are known and the number of test items. For any 
two groups, A and B, we have 



,T7T „ /*o(N-/*„) cl 

( Go = _A ^ - ]A and ( >5 =/ B B . jb , 



N 
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where and are the means over all K persons and all H measure- 



B 



ments of the 0^^^ and 0^^^ values. Substituting these results in (5) 
and factoring, we obtain 
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Only if the right-hand term is zero is (1) valid. Otherwise, we can 

-2 2 2 
substitute in ( 6 ) for 0 its equivalent ^oOg (Tog and for its 
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equivalent ^ oo^ G o^, solve for oo^ and simplify to obtain 
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It should be noted that the right hand term becomes zero if Aa °7*°b 

or if « (N - Mq ) , in which case (7) reduces to 

'A ' B 
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. Further, if the 



2 2 

right hand term is not zero, but Co.. = Co.„, (7) reduces to 

jA jB’ 



^ 00 
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Go ( /^o + Mo - N) ( Mo - Mo ) 
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B 



Chance Success Model 



B 



Using the model for chance success presented previously (Zimmerman 

& Williams, 1965; Burkheimer, Zimmerman 6e Williams, 1967) we can write 
2 

(G ) = p(N - 0 ), where p, the probability of chance success, is 
J J 

assumed to be the same for all pers ons. Finding the arithm etic me an 

2 " 2 

for groups A and B as before gives (G^ )a ' <Co >B = 

J A j 

p(N Substituting these results in (5) we have 
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Only under the conditions that ~ or that p * 0 is (1) 



obtained . 
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Part II: 



f 

Conditions Under Which Coefficient Alpha Equals Test 
Reliability: the Case of Heterogeneous Score Distributions 
and Correlated Score Components 

Since the original derivation by Kuder and Richardson (1937) of 
the formula which Cronbach designated as coefficient alpha the assump- 
tions required have been more fully explicated (Jackson & Ferguson, 
1941, Guttman, 1945, Gulliksen, 1950, Cronbach, 1951, Novick & Lewis, 
1967). Derivations have been based on the classical test theory model 
in which true scores and error scores are uncorrelated and error scores 
on parallel tests are uncorrelated. The present derivation of coef- 
ficient alpha is somewhat different from those which have appeared in 
the literature, reveals the necessary and sufficient conditions under 
which this quantity is equal to the reliability of the test, and indi- 
cates that the formula is valid under certain conditions where the 
classical test theory mrdel is not applicable. 

Definitions 



The subscript i^, taking on values from 1 to N, will refer to the 
parts of a test. 



We consider the distributions of observed part-test scores, 

Ogji, and total test scores, , for person over H independent re- 
peated measurements. We further consider the distributions of observed 
part-test scores, 0gj£> total test scores, 0'., for this same 
person i over H independent repeated parallel melJurements. Also, we 
consider the quantities Dj and ( CToj) > which have been defined above. 
Analogous quantities, and define the mean and variance of 

part-test scores for person over repeated measurements. 



The reliability of a test is defined as above. The reliability 
of a part-test, (^qo^^* defined in the same way, in terms of 
and O'j.. 



It is assumed that, as H increases without limit, the frequency 

distributions of 0^^ and 0’ . are the same for any person j_, such that 

oJ oJ 

(()oj) “ ^^oj^ such distributions 

are*^the same for dilferent persons (cf* Zimmerman, Williams & 
Burkheimer, 1968). The same concept applies to the part-test scores, 
0gj£ and Ogj^. It follows from the above that ® -G-oj = ffoj . 

and/®-,-! = 1. Likewise, these relationships hold for the part -test 
scores. ° 



Another expression for test reliability is given in (4) above, 
and an expression for the reliability of part tests can be written as 
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. Here, the terms used are 





G 



2 

o 



i 



analogous to their counterparts defined above, where the subscript i 
indicates that they refer only to the i^h part of the test. 

The following symbols will also be used: 




0 . = ® 

^ . H N 

K 
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is the variance of 0^ over all K persons. 



C 



( 6 - ). 

“ij J 






o O Co Co 
n m n m 



is the variance of 0. over all N part-tests. 

1 



is the variance of 0 for person 1 over all 

ij 

N part-tests. 

t 

is the variance of 0^ over all K persons and 
all N part-tests. 

is the covariance of 0 for any two part-tests 

gji 



t 



rt~ is the covariance of 0 for any two part-tests, 
o 4^0 



jn jm jn jm 



ij 



Derivation of Coefficient Alpha 
Total observed variance can be written as follows: 



N 



C). “ 



i**l 



N N 






Variance of 0 



n*=l m**i 
n^m 



j 



can likewise be written 
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N N N 

i*l n=l npl 

n?*m 



jm jn jm 



16 



Subtracting tha second equation from the first yields 




Relating the second expression in the right hand member of this 
equation to equation (3) and following the same argument, it is evident 
that this expression is zero* Substituting ^ for its equivalent 

value (JJ and reducing, wc arrive at ^oo '^o 



(9) 



N 



i.V 






/®oo 



1 - 



^ Oj 

i=l I 

Co 









ij 



N 

I 



G 



l - 









m 2 

Co 



Since Q? = (Q[? )< + C= and ff? 

°ij °ij ^ “°i ° 



= (.gI ). + cr= , It 

ij °ij J Oj 



follows that (Qi . (gf ) +(y2 . Q-i . Substituting 

ij °ij j °i 



this result in (9) gives 



N 




2 

Co 



( 10 ) 



^ 00 



1 - 



i=«l 



i-J 



NO' 



m 2 

nQ= 

'^o. 

1 
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From the above definition of 0. it is evident that 

3 



nq2 

o 



j 






-2 

!!i.A 

N(Io 



o 

N 



Substituting in (10) and solving for 



P 30 



leads to the following result: 




which, except for the right hand term, is identical to OL, 



We can write OL - /^q (G“ )•; other words, 

/ o J o. 

^3 • 1 

coefficient alpha is equal to the reliability of the test if and only 
if, as the number of measurements increases without limit, the mean 
over persons of the variances over part -tests of the mean part-test 
scores over repeated measurements is equal to the variance over part 
tests of the means over persons of the mean part-test scores over 
repeated measurements# 



In the classical test theory model 0 . * T + E 

H . j gj 



and T, * lim 



I 

8*1 



0 



gj 



i H 



0 , where T and E are true and 

j j gj 



error components of scores# In the present derivation 
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r 



H 



the quantity lim 




has not been identified with the concept 



of true score. If, however, 0. is a lineax* function of the con- 
clusions reached by Novick & L^wis (1967) hold even when ^true and 
error components of scores are correlated and when the distributions of 
scores over repeated measurements are different ^rom person to person 
(Burkheimer, Zimmerman & Williams, 1967, Williams & Zimmerman, 1966, 
Zimmerman &• Williams, 1965, 1966). That is, even in this case the nec- 
essary and sufficient condition for coefficient alpha to equal test re- 
liability is that all part-tests be essentially t-eguivalent , as de- 
fined by Novick & Lewis. We can now_write these conditions in the 
following form: for every j., 0 = Omj + C , for any n, m - or, that 

the mean part-test scores over repeated measurements for person differ 
at most by constants, which must be the same for all K persons, but not 

all 



In other words, the variances of all parts are the same, the inter- 
correlations among all parts are the same and equivalent to the re- 
liability of each part, and the correlation between the 5. . values is 
unity for all pairs of part-tests. Thus, all parts of the^test must be 
parallel measurements in the usual sense of "parallel" in test theory, 
except for differences in scores by a constant. 

The present results are consistent with those obtained by 
Jackson and Ferguson (1941) and Gulliksen (1950). Tlie above condition 
may be expressed by saying that over repeated measurements the average 
covariance between parts within a test must equal the average covariance 
for each part. Still another way of expressing the condition is this: 
a single administration of a test can yield an estimate of reliability 
only if t:he degree of correlation between parts within the test provides 
information as to the correlation which would be obtained over repeated 
measurements . 

Wiien individual items are considered Od is equivalent to the 
Kuder -Richard son formula 20.* That formula, then, is equal to the re- 
liability of a test if and only if all test items satisfy the above 
conditions. 



The conditions under which the reliability of a composite test 
of N parts is equal to coefficient alpha are identical to the conditions 
under which the reliability of a test lengthened N times is given by 
the Spearman-Brown formula. The latter condition is sometimes expressed 




Equivalence of Coefficient Alpha and 
the Spearman-Brown Formula 
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by saying that the parts added to the orig;inal test must all be parallel 
But this is just the condition required for coefficient alpha to equal 
the reliability of a composite test. No requirement need be made in 
either case as to the degree of correlation between true and error com- 
ponents of scores (Zimmerman .& Williams, 1966). 

In the case in which N refers to items, the following statement 
can be made: the conditions under which the reliability of a test of N 

items is given by the Kuder -Richard son formula 20 are identical to the 
conditions under which the reliability of a one -item test lengthened N 
times is given by the Spearman -Brown formult'. The relationship can be 
seen by letting (^o^i reliability of one part-test, or one 

item, before tha test is lengthened. If the above conditions are met 



expression in terms of part-test variances and covariances and write 




2 

NQ^ and we can substitute in OC its equivalent 
^i 



i=*l 



( 12 ) 




1 




Relation of the Kuder-Richardson Formula 20 to the 
Kuder -Richardson Formula 21 



Assume that the conditions 
individual test items are consider^v. 




are met. Then, when 




N - 1 



N 



1 - 




N 




Expanding this 



expression and reducing, we obtain 



( 13 ) 



/o =^i_ 
f 00 H - 1 






nGo 



2 -I 

i 



1 - 



N(r; 



or; 



where is the mean of observed scores over all K persons and all H 
repeated measurements. When * 0, this result is equivalent to the 

Kuder-Richardson formula 21. In other words, the condition required for 
the K.-R. formula 20 to equal the reliability of the test is that 

** * and the condition required for K.-R. 21 is that 



Oij j 



G2 = 0. 



Derivation of Coefficient Alpha from 

Item Sampling Model 



An interpretation of the Kuder-Richardson formula 21 can be 
based on the model presented by Lord (1955) for the sampling of test 
items from a population of items. Lord demonstrated that an individual *s 

o..(» - Ogj) 



standard error of measurement is estimated by 



gj 



and that when 



N - 1 



that quantity is averaged over all persons and substituted in the classi- 
, 0>e 

cal equation = 1 - — - , the K.-R. formula 21 is' obtained. The 

ffo 

K.-R. formula 21, in other words, can be interpreted as an equation for 
test reliability for the case in which the observed score distribution 
over repeated measurements is identified with variations in score re- 
sulting from item sampling. 

It should be noted that the above requirement that 

2 2 

(Go-*Pi ” ^ Oj “ 0 is implicit in this model. From the concept of 

1 J J 1 

item sampling employed by Lord, it follows that, as H increases without 

limit, for any j, 0. =0., for all i, such that ( becomes zero 

J J °ii j 

= ‘*2 
for all j. Further, 0. becomes the same for all i, such that G= “0. 

Oi 
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It should be noted that K.-R. 20 would be equally applicable to this 
model* 



Let us assume now, that the N items of a test are drawn from N 

distinct populations and that the values of O^i j for person j may differ. 

Then, the distribution of observed scores over repeated measurements is 

given by a Poisson sequence of trials, (G^*) “ n 3.(1 - 0.) - N(G~ )i 

ojj j O^j J 



Averaging this quantity over all K persons, substituting the result in 
equation (4), and simplifying we obtain (11). 






In other words, equation (11) gives test reliability for the 
case in which the N items of a test are sampled from N distinct popu- 
lations of items. The K.-R. formula 20 gives test reliability only if 



(C? ), andff2 are equal. Finally, the Kuder-Richardson formula 21 

°i.i ^ °i 

gives test reliability only if both these quantities are zero. 
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Part III ; Dependence of Reliability of Multiple - Choice Tests Upon 

Number of Choices Per Item ; Prediction From the Spearman - 
Brown Formula 

It has been known for some time that the reliability of multiple- 
choice tests is influenced by the number of choices per item (Remmers, 
Karslake & Gage, 1940; Lord, 1944; Carroll, 1945; Plumlee, 1952). Since 
the probability of chance success on an item is ^ , where a is the number 

a 

of choices per item, it is to be expected that error variance introduced 
by chance succejs is a decreasing function of number of choices and test 
reliability is an increasing function of number of choices. 

Remmers and his associates suggested the relationship could be 
described by the Spearman- Brown formula, which is known to indicate 
increase in reliability with increase in test length. The formula is 



nr 



(14) 



00 



noo 



1 + (n - 1) r 



00 



where r^^ is the original reliability, r^^^^ is the reliability of the 



test of increased length, and n is the number of times the test is 
increased in length. Remmers showed empirically that the reliability 
of various tests is approximated by this function, when n refers to 
increase in number of choices instead of test length. It has been 
pointed out, however, that there is no theoretical basis for predicting 
this result (Lord, 1944; Guilford, 1950; Gulliksen, 1950). 

Computer Simulated Results 

In a previous paper (Zimmerman & Williams, 1965) a computer pro- 
gram was used to simulate guessing error in multiple-choice tests. 
Distributions of assumed true scores were prepared, and error scores 
were generated on the basis of the probabilities to be expected from 
chance success due to guessing. The error scores were added to true 
scores to obtain observed scores. Finally, product-moment correlations 
between different sets of observed scores obtained by repeating the 
procedure several times gave an indication of test reliability. 

The results of this procedure for tests differing in length and 
number of choices are shown in Table 1. The data in this table can be 
used to examine the effect of increased test length, as well as increased 
number of choices, upon reliability. Apparently, there is an inter- 
action between the effects of test length and number of choices. 
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TABLE 1 



COMPUTER SIMULATED RESULTS FOR RELIABILITY 





N = 10 
a = 2 


N = 10 
a = 5 


N = 100 
a = 2 


N = 100 
a = 5 


H 

O 

O 


.44 


.74 


.89 


.97 


r ** 
00 






.89 


.97 


r *** 

00 




.76 




.97 


r •k'k'k'k 

00 




.66 




.95 



^Reliability given by computer program. 

**Reliability given by substituting .44 or .74 in Equation 14. 

***Re liability given by substituting .44 or .89 in Equation 18. 
liability given by substituting .44 or .89 in Equation 14. 
where n = 2.5. 

For short tests (N = 10) reliability increases greatly with increase in 
number of choices (.44 to .74). For long tests (N = 100) reliability 
increases slightly with number of choices (.89 to .97). Also, for 2 
choices, reliability increases greatly with test length (.44 to .89). 

And for 5 choices reliability increases to a lesser degree with tecit 
length (.74 to .97). 

From the table it is seen that the Spearman- Brown formula describes 
the increase in reliability with increase in test length for both the 
2-choice test' and the 5-choice test (Zimmerman & Williams, 1966). 
Consider, now, Remmers' suggestion that the same formula describes 
increase in reliability with increase in number of choices. The results 
in the table show that there is a greater discrepancy, although the 
predicted value for the longer test is close to that indicated by the 
program. 



Increased Reliability As a Function of 
Increased Number of Choices 



It is possible to derive a simple equation showing the effect of 
Increasing the number of choices upon reliability for the case in which 
only error due to guessing is present. Reliability is given by 



(15) * 




(a - 1) 3^ 

(a « 1) s^ + N - T 
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where a is the number of choices, s^ is the variance of true scores,. 

N is the number of items, and T is the mean of true scores. This 
Equation gives the value which is approximated by the computer simu- 
lation method described above (Burkheimer, 1965; Burkheimer, Zimmerman, 
& Williams, 1967). When the number of choices is increased, we can 
write 



(16) 



(a- - 1) 

(a' - 1) s2 + N - T 



where r ' is the reliability for the test with increased number of 
00 

choices, a is the original number of choices, a_^ is the increased num 

ber of choices, and the other symbols are as defined above. Solving 
2 

(15) for s gives 



(17) 




(H • 7^) 

(a - 1) (1 - 



Substituting this result in (16) and simplifying, we have 



(a- - 1) r 



(18) 



|. _ 



00 



00 



(a' - 1) + (a - a') r 



00 



The data presented in Table 1 show that substitution in this equation 
yields results close to those indicated by the computer program. The 
accuracy is greater than that obtained by using (14) and of the same 
order as that obtained by using (14) for increased test length. 

If the m.ethod employed by Remmers were valid, the ratio a_[. would 

a 

be comparable to n in (14), which could be written in this form: 



(19) 
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' I 

' i 
M 



Simplifying, we obtain the following result 



a,' r 



( 20 ) 



’^ 00 ' = 



00 



a + (a' - a) r 



00 



which can be compared to (18) . It is seen, therefore, that equation 
(18) differs from the modification of the Spearman-Brown formula 
suggested by Remmers only by subtraction of 1 from the factor in the 
numerator and the a term in the denominator. If both a_|^ and a were 
^arge (14) and (18) would give nearly the same results. For multiple- 
choice tests, however, a^ and a are relatively small, and some discrep- 
ancy can be expected. 

Dividing both numerator and denominator of (18) by a - 1 gives 



(a* - 1) 

(i^njr 



( 21 ) 



00 



00 



(a - 1) + (a* - a) 
(a - 1) (a - l)r 



00 



(a' - 1) 

If, now, we define A as the ratio (a - 1) and simplify, we have 



Ar 



( 22 ) 



I _ 



00 



00 



1 + (A - 1) r 



00 



which has the same form as the Spearman-Brown fc. Tiula. In other words, 

(a* - 1) 

Reirmers' suggestion is valid if we employ the ratio in the 

Spearman- Brown formula, but not if we employ the ratio . It should 
« d 

be noted that the above equations apply only to the case in which 
differences in reliability result from chance success due to guessing. 

Dependence of Correlation Between Error Scores On 
Parallel Forms Upon Number of Choices 

It is of interest that an equation showing the dependence of the 
correlation between error scores on parallel forms of a test upon num- 
ber of choices can also be derived. This quantity has been assumed to 
be zero in the classical theory of mental tests. However, when chance 
success due to guessing is present, as in the case of most multiple- 
choice tests, it can be shown that it is positive in value, that it 
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decreases with number of choices, and that the relationship is indicated 
by an equation similar to (18) . 

Correlation between error scores on parallel forms is in fact given 
by the following equation: 



(23) 



r 

ee 



s 



2 

t 



s? + (a - 1) (N - T) 



where the symbols are as defined above (Burkheimer, 1965; Burkheimer, 
Zimmerman & Williams, 1967). When number of choices is increased, we 
can write 



(24) 



’^ee' ' 



sj + (a' - 1)(H - T) 



Solving (23) for gives 



, r (a - 1)(H - T) 
(25) si = 



(1 " r ) 
' ee' 



Substituting (25) in (24) and simplifying leads to this result: 



(a - 1) r 

^ I ee 

(a- - 1) - (a’ - a) r 

ee 

both numerator and denominator of (26) by a' - 1 gives 

( a - 1) r 

, (a' - 1) ■■ 

(a' - 1) (a' - a) r.. 

(a- - 1) ‘ (a- - 1) 

If we define B » 1. * (a - 1) and simplify, we have 

A (a' - 1) 



(26) 

Dividing 

(27) 
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Br 

(28) r ' = 

1 + (B - 1) 

which, again, has the same form as the Spearman- Brown formula. There 
'exists no analogue of this equation in the classical theory of mental 
tests. From (26) and (28) it is clear that’ the degree of correlation 
between error scores on parallel forms decreases with an increase in 
the number of choices. 

The results given by the Goirtputer program for r are shown in 

Table 2. Equation (26) predicts accurately the effect of increasing 

number of choices upon r^ . Another fact of interest shown in the 

ee 

table is that, if r^^ is treacled as a reliability coefficient, the 

Spearman- Brown formula indicates accurately the change in its value 
with change in test length (Zimmerman & Williams, 1966). For longer 
tests the correlation between error scores on parallel forms becomes 
higher in value, and the degree of change is indicated by the Spearman- 
Brown formula. 



TABLE 2 



COMPUTER SIMULATED RESULTS FOR CORRELATION 
BETWEEN ERROR SCORES ON PARALLEL FORMS 





o 

II 


o 

u 


N == 100 


N = 100 




a = 2 


a = 5 


a = 2 


a = 5 


r * 
ee 


.46 


,17 


.89 


.65 


r ** 
ee 


1 




.90 


.67 


r *** 
ee 




.18 




.66 



*Value given by computer program. 

**Value given by substituting .46 or .17 in Equation (14). 
***Value given by substituting .46 or .89 in Equation (26). 
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CONCLUSIONS 



The equations derived as a result of this research are theoretical in 
nature. Even using the model which does not introduce the theoretical 
concepts of true score and error components, it is necessary to rely on a 
theoretical distribution of observed scores that would be obtained if one 
were able to measure individuals repeatedly on an extremely large number 
of parallel forms of some test. Further, the validation procedure used is 
a gross simplification of the empirical testing situation. No matter how 
complex a computer simulation technique - in this research the complexity 
was greatly limited by the limited capacity of the computer used - it is 
highly unlikely that one could program all the determinents of human test- 
taking behavior. 

I 

For these reasons, it is not likely that the results of this research 
will have any immediate application to the day-to-day problems of testing. 

It is interesting to note, however, that equations derived from the model 
of non- independence have been tested on the results obtained in an empirical 
situation and that the implications of the theory of non- independence were 
supported (Zimmerman, et al., 1966). Certain conclusions can be drawn, of 
course, from the results of this work that are applicable to an empirical 
situation. The classical equation relating reliability of a test as 
established for one group to the reliability that should be obtained for 
that test with another group with differing score variability should be 
used with caution when the test is of the multiple-choice type. One should 
be aware of the limiting assumptions necessary for coefficient alpha and 
related equations to be appropriate estimates of test reliability; and 
certainly the users of multiple-choice tests should be aware that tests 
with a greater number of alternatives per item are generally more reliable 
measuring instruments. 

It is also considered quite important that one have an understanding 
of the implications of the presence of this particular class of error. 

Use of multiple-choice tests is widespread, and it is precisely this type 
of test that is subject to the appearance of non-independent error due to 
chance guseesing success. Since this type of error operates somewhat 
differently than the error that has traditionally been considered in test 
theory, new equations describing the effects of such error are needed. 

Such equations taken with those previously developed in classical theory 
should lead to a fuller understanding of an individual's score on a test. 

I » 

This research is a beginning toward a more general theory of mental 
tests that takes into consideration the many classes of error reflected in 
an individual's score. Hopefully it will serve as a foundation for further 
research into non-independent error operating both alone and in combination 
with other classes of error. More important, it may hopefully serve as a 
catalyst in precipitating further research. 
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I .licharcoon formula 20. The increase in test reliability with increase in number 
of alternatives per it^m is considered. The derived equation describing i.nis 
’ v-el. 4 tionship Is expressed in a form similar to the Spearman- Brown formula tor 
increase in reliability with increase in test length. 
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